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1. Introduction 



The Lasso is an l\ penalized least squares estimator in linear regression models 
proposed by Tibshirani 17|. The Lasso enjoys two important properties. First, 
it is naturally sparse, i.e., it has a large number of zero components. Second, 
it is computationally feasible even for high-dimensional data (Efron et al. Q, 
Osborne et al. [16]) whereas classical procedures such as BIC are not feasible 
when the number of parameters becomes large. The first property raises the 
question of model selection consistency of Lasso, i.e., of identification of the 
subset of non-zero parameters. A closely related problem is sign consistency, 
i.e., identification of the non-zero parameters and their signs (cf. Bunea 
Meinshausen and Buhlmann Meinshausen and Yu |l4 ] Wainwright (ioj . 



Zhao and Yu [22[ and the references cited in these papers). 

Zou [23I ] has proved estimation and variable selection results for the adaptive 
Lasso: a variant of Lasso where the weights on the different components in the l\ 
penalty vary and are data dependent. We mention also work on the convergence 
of the Lasso estimator under the prediction loss: Bickel, Ritov and Tsybakov 
H, Bunea, Tsybakov and Weg kamp Q , Greenshtein and Ritov , Koltchinskii 
lllELl, Van der Geer 0;^ 



90 
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Knight and Fu [lO( have proved the estimation consistency of the Lasso es- 
timator in the case where the number of parameters is fixed and smaller than 
the sample size. The I2 consistency of Lasso with convergence rate has been 
proved in Bickel, Ritov and Tsybakov [lj , Meinshausen and Yu 14 1 , Zhang and 



Huang [21]. These results trivially imply the l p consistency, with 2 ^ p < 00, 
however with a suboptimal rate (cf., e.g., Theorem 3 in [21(). Bickel, Ritov and 
Tsybakov [l[ have proved that the Dantzig selector of Candes and Tao @ shares 
a lot of common properties with the Lasso. In particular they have shown si- 
multaneous l p consistency with rates of the Lasso and Dantzig estimators for 
1 ^5 p ^ 2. To our knowledge, there is no result on the l x convergence rate and 
sign consistency of the Dantzig estimator. 

The notion of and sign consistency should be properly defined when the 
number of parameters is larger than the sample size. We may have indeed an 
infinity of possible target vectors and solutions to the Lasso and Dantzig min- 
imization problems. This difficulty is not discussed in @; O; 14; 2(J 21 1 where 



either the target vector or the Lasso estimator or both are assumed to be unique. 
We show that under a sparsity scenario, it is possible to derive and sign con- 
sistency results even when the number of parameters is larger than the sample 
size. We refer to Theorem 6.3 and the Remark 1, p. 21, in 1] which suggest a 
way to clarify the difficulty mentioned above. 

In this paper, we consider a high-dimensional linear regression model where 
the number of parameters can be much greater than the sample size. We show 
that under a mutual coherence assumption on the Gram matrix of the design, 
the target vector which has few non-zero components is unique. We do not 
assume the Lasso or Dantzig estimators to be unique. We establish the 
convergence rate of all the Lasso and Dantzig estimators simultaneously under 
two different assumptions on the noise. The rate that we get improves upon those 
obtained for the Lasso in the previous works. Then we show a sign concentration 
property of all the thresholded Lasso and Dantzig estimators simultaneously for 
a proper choice of the threshold if we assume that the non-zero components 
of the sparse target vector are large enough. Our condition on the size of the 



non-zero components of the target vector is less restrictive than in [2CH22I ]. In 
addition, we prove analogous results for the Dantzig estimator, which to our 
knowledge was not done before. 

The paper is organized as follows. In Section[5]we present the Gaussian linear 
regression model, the assumptions, the results and we compare them with the 
existing results in the literature. In Section [3] we consider a general noise with 
zero mean and finite variance and we show that the results remain essentially 
the same, up to a slight modification of the convergence rate. In Section |4] we 
provide the proofs of the results. 

2. Model and Results 

Consider the linear regression model 

Y = X9* + W, (1) 
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where X is an n x M deterministic matrix, 9* G M M and W = (Wi, . . . , W n ) T is 
a zero-mean random vector such that E[VF 2 ] a 2 , 1 ^ i ^ n for some c 2 > 0. 
For any 9 € R M , define J(0) = {j : } ^ 0}. Let M(0) = \J{6)\ be the 
cardinality of J(0) and sign(0) = (sign(0i), . . . , sign(0jtf)) T where 

' 1 if t > 0, 
sign(t) = ^ if t = 0, 
1 if t < 0. 

For any vector € M M and any subset J of {1, ... , M}, we denote by 0j the 
vector in ]R M which has the same coordinates as on J and zero coordinates on 
the complement J c of J. For any integers 1 $5 d,p < oo and z = (zi, . . . , zj) £ 

A / d \ Vp 

R d , the Z p norm of the vector z is denoted by \z\ p = (X}j=i l z j l p ) > an d 
|z|oo = maxi^j^rf 

Note that the assumption of uniqueness of 9* is not satisfied if M > n. In this 
case, if a vector 9* = 9° satisfies (JTJ) , then there exists an affine space 0* = {0* : 
X9* = X9 } of dimension ^ M — n of vectors satisfying fl}. So the question 
of sign consistency becomes problematic when M > n because we can easily 
find two distinct vectors 9 1 and 9 2 satisfying (TT]) such that sign(6' 1 ) ^ sign(0 2 ). 
However we will show that under our assumption of sparsity 9* is unique. 

The Lasso and Dantzig estimators 9 L , 9 D solve respectively the minimization 
problems 

min -\Y - X9\l + 2r\9\ 1 , (2) 

eel" n 

and 



min \9\i subject to 



-X T (Y-X9) 
n 



< r, (3) 



where r > is a constant. A convenient choice in our context will be r = 
Aa^J (log M)/n, for some A > 0. We denote respectively by L and Q D the set 
of solutions to the Lasso and Dantzig minimization problems ^ and ([3]) . 

The definition of the Lasso minimization problem we use here is not the same 
as the one in [l7j], where it is defined as 

min -\Y- X9\l subject to \0\ x ^ t, 
6>eR M n 

for some t > 0. However these minimization problems are strongly related, cf. 
1 5]. The Dantzig estimator was introduced and studied in fgj. Define $(0) = 
-MY — X9\ 2 + 2r\0\\. A necessary and sufficient condition for a vector 9 to 
minimize <f> is that the zero vector in R M belongs to the subdiffcrcntial of 4> at 
point 9, i.e., 

f ±(X T (Y - X0))i = signer if 9 3 jt 0, 
\\±(X T (Y-X9)) 3 \^r if ^ = 0. 
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Thus, any vector 9 £ Q L satisfies the Dantzig constraint 

< r. (4) 



-X T (Y-X9) 
n 



The Lasso estimator is unique if M < n, since in this case &(9) is strongly 
convex. However, for M > n it is not necessarily unique. The uniqueness of 
Dantzig estimator is not granted either. From now on, we set 8 = L or & D 
and 9 denotes an element of 0. 

Now we state the assumptions on our model. The first assumption concerns 
the noise variables. 

Assumption 1. The random variables Wi, . . . , W n are i.i.d. Af(0, a 2 ). 
We also need assumptions on the Gram matrix 

* £ l -x T x. 

n 

Assumption 2. The elements \?.y of the Gram matrix \t satisfy 

^,, = 1, Vl<i<M, (5) 

and 



max ^ — | - — , (6) 



1 

•j^i-tji a Q ~ 2co )s' 

for some integer s ^ 1 and some constant a > I, where cq — 1 if we consider 
the Dantzig estimator, and cq = 3 if we consider the Lasso estimator. 

The notion of mutual coherence was introduced in Q where the authors 
required that max^j were sufficiently small. Assumption [5] is stated in a 

slightly weaker form in 

Consider two vectors 9 1 and 9 2 satisfying (TTJ) such that M(9 1 ) s and 
M{9 2 ) < s. Denote 9 = 9 1 - 9 2 and J = /{O 1 ) U J{9 2 ). We clearly have X9 = 
and \J\ ^ 2s. Assume that 9^0. Under Assumption [2 similarly as we derive 
the inequality (fTTj) in Section [4] below and using the fact that \9\i < v / 2s|6*|2, 
we get that 



n\0\? 



> 0. 



12 

This contradicts the fact that X9 = 0. Thus we have 9 1 = 9 2 . We have proved 
that under Assumption [5] the vector 9* satisfying fT} with M{9*) ^ s is unique. 

Our first result concerns the l x rate of convergence of Lasso and Dantzig 
estimators. 



Theorem 1. Take r — Aa^J (log M)/n and A > 2V2- Let Assumptions \l\B be 

satisfied. If M(9*) < s, then 

c 2 r) > 1 - M X - AV \ 
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Theorem Q] states that in high dimensions M the set of estimators O is neces- 
sarily well concentrated around the vector 9* . Similar phenomenon was already 
observed in [lj], cf. Remark 1, page 21, for concentration in l p norms, 1 ^ p ^ 2. 
Note that C2 in Theorem [1] is an absolute constant. Using Theorem [IJ we can 
easily prove the consistency of the Lasso and Dantzig estimators simultaneously 
when n — > oo. We allow the quantities s, M , O, 9* to vary with n. In particular, 
we assume that 

logM 

M — > oo and lim = 0, 

n — >oc Ti 

as n — > oo, and that Assumptions [TT21 hold true for any n. Then we have 



sup 



(7) 



in probability, as n — > oo. The condition (log M)/n — > means that the number 
of parameters cannot grow arbitrarily fast when n — > oo. We have the restriction 
M = o(exp(n)), which is natural in this context. 

A result on consistency of Lasso has been previously stated in Theorem 3 



of 2l|, where 9 L was assumed to be unique and under another assumption on 
the matrix 4". It is not directly related to our Assumption [2j but can be deduced 
from a restricted version of Assumption [2] where a is taken to be substantially 
larger than 1. The result in [2l| is a trivial consequence of the I2 consistency, and 
has therefore the rate \9 L — 9*\ aD = Op(s 1//2 r) which is slower than the correct 



rate given in Theorem [T] In fact, the rate in 2l| depends on the unknown 
sparsity s which is not the case in Theorem [TJ Note also that Theorem 3 in [2l| 
concerns the Lasso only, whereas our result covers simultaneously the Lasso and 
Dantzig estimators. 

We now study the sign consistency. We make the following assumption. 

Assumption 3. There exists an absolute constant c\ > such that 

p = min \9*\ > c\r. 



We will take r = Aa^J (log M)/n. We can find similar assumptions on p in 
the work on sign consistency of the Lasso estimator mentioned above. More 



precisely, the lower bound on p is of the order s 1 / 4 r 1 / 2 in 14|, n s ^ 2 with 
< 5 < 1 in [H [13], yJ(\ogMn)/n in and yfsr in (2l[. Note that our 
assumption is the less restrictive. 

We now introduce thresholded Lasso and Dantzig estimators. For any 9 S 
the associated thresholded estimator 9 € E M is defined by 

1 elsewhere. 

Denote by O the set of all such 9. We have first the following non-asymptotic 
result that we call sign concentration property. 
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Theorem 2. Take r = Aa^J (log M)/n and A > 2%/2. Let AssumptionsUSE be 
satisfied. We assume furthermore that c\ > lei , where ci is defined in Theorem, 
Then 

P (sign(0) = sign(0*), V0 G <=>) ^ 1 - M ^Ve, 

Theorem [2] guarantees that every vector 9e 6 and 6** share the same signs 
with high probability. Letting n and M tend to oo we can deduce from Theorem 
[2] an asymptotic result under the following additional assumption. 

Assumption 4. We have M — > oo and linin^oo °^ =0, as n — > oo. 

Then the following asymptotic result called sign consistency follows immedi- 
ately from Theorem [2] 

Corollary 1. Let the assumptions of Theorem^ hold for any n large enough. 
Let Assumption^ be satisfied. Then 



(sign(<9) 



sign(r), V0 g e -» i 



as n — > oo . 



The sign consistency of Lasso was proved in 13; 2^ with the Strong Irrep- 
resentable Condition on the matrix \1/ which is somewhat different from ours. 
Papers assume a lower bound on p of the order n s ^ 2 with < S < 1, 

whereas our Assumption [3] is less restrictive. Note also that these papers assume 
9 L to be unique. Wainwright [2Qj does not assume 9 L to be unique and discusses 
sign consistency of Lasso under a mutual coherence assumption on the matrix "J" 
and the following condition on the lower bound: y (log M) /n = o{p) as n — > oo, 
which is more restrictive than our Assumption [3J In particular Proposition 1 in 
(20j states that as n — * oo, if the sequence of 9* satisfies the above condition for 
all n large enough, then 



P [39 L G 6^ s.t. sign(6» i ) = sign(6>*)J 1. 

This result does not guarantee sign consistency for all the estimators 9 L G 6 L 
but only for some unspecified subsequence that is not necessarily the one cho- 
sen in practice. On the contrary, Corollary[T]guarantees that all the thresholdcd 
Lasso and Dantzig estimators and 9* share the same sign vector asymptotically. 
It follows from this result that any solution selected by the minimization algo- 
rithm is covered and that the case M > n, where the set O is not necessarily 
reduced to an unique estimator, can still be treated. We note also that the 
papers mentioned above treat the sign consistency for the Lasso only, whereas 
we prove it simultaneously for Lasso and Dantzig estimators. An improvement 
in the conditions that we get is probably due to the fact that we consider 
thresholded Lasso and Dantzig estimators. In addition note that not only the 
consistency results, but also the exact non-asymptotic bounds are provided by 
Theorems Q] and [21 
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3. Convergence rate and sign consistency under a general noise 

In the literature on Lasso and Dantzig estimators, the noise is usually assumed 
to be Gaussian [J [|| [IH; [2(| or admittin g a f inite exponential moment [2; 14|. 
The exception is the paper by Zhao and Yu [22j who proved the sign consistency 
of the Lasso when the noise admits a finite moment of order 2k where k ^ 1 
is an integer. An interesting question is to determine whether the results of the 
previous section remain valid under less restrictive assumption on the noise. In 
this section, we only assume that the random variables Wi,i = 1, . . . ,n, are 
independent with zero mean and finite variance E[Wj 2 ] ^ a 2 . We show that the 
results remain similar. We need the following assumption 

Assumption 5. The matrix X is such that 

1 n 

- V max \Xi j\ 2 s$ c', 

2—1 



for a constant c' > 0. 



For example, if all Xi j are bounded in absolute value by a constant uniformly 
in then Assumption [3] is satisfied. The next theorem gives the rate of 
convergence of Lasso and Dantzig estimators under a mild noise assumption. 

Theorem 3. Assume thatWi are independent random variables withK\Wi] = 0, 

E[W 2 ] s? a 2 , i = 1, . . . , n. Take r = a ^ (log A * }1+ - , with S > 0. Let Assumptions 
l£| 5\ be satisfied. Then 



< c 2 r > 1 




(logM) 5 ' 

where C2 is defined in Theorem^ and c> is a constant depending only on c' . 

Therefore the convergence rate under the bounded second moment noise 
assumption is only slightly slower than the one obtained under the Gaussian 
noise assumption and the concentration phenomenon is less pronounced. If we 
assume that lim rl ^ 00 (logM) 1+l5 /n = and that Assumptions I2|3l and l5l hold 
true for any n with r = <J\J (log M) 1+s /n, then the sign consistency of thresh- 
olded Lasso and Dantzig estimators follows from our Theorem [3] similarly as 
we have proved Theorem [5] and Corollary [TJ Zhao and Yu stated in their 
Theorem [3] a result on the sign consistency of Lasso under the finite variance 
assumption on the noise. They assumed 9 L to be unique and the matrix X to 
satisfy the condition maxi^^ n (^ J=1 X?j)/n — > 0, as n — > oo. This condition is 
rather strong. It does not hold if M > n and all the Xij are bounded in absolute 
value by a constant. In addition, [22| assumes that the dimension M = 0(n s ) 
with < S < 1, whereas we only need that M = o(exp(n 1 ^ 1+s ' > )) with 5 > 0. 



Note also that [22[ proves the sign consistency for the Lasso only, whereas we 



prove it for thresholded Lasso and Dantzig estimators. 
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4. Proofs 



We begin by stating and proving two preliminary lemmas. The first lemma 
originates from Lemma 1 of Q and Lemma 2 of jjj. 

Lemma 1. Let Assumption]]^ and ^ of Assumption]^ be satisfied. Take r = 
Acr^J (log M) Jn. Here denotes either Q L or <d D . Then we have, on an event 
of probability at least 1 — M~ A / 8 , that 



sup 

dee 



#(0* - 



3r 



OO 2 



(8) 



and for all 9 £ 0, 



|Aj (fl .).|i < colAj^jli, (9) 

where A — 9 — 8* , cq = 1 /or ffte Dantzig estimator and cq = 3 /or f/ie Lasso. 

Proof. Define the random variables Zj = n _1 X)2=i XijWi, 1 ^ j ^ M. Using 
© we get that Zj ~ 7V(0, a 2 /n), 1 < j < M. Define the event 

M 
3=1 

Standard inequalities on the tail of Gaussian variables yield 
P(A C ) < MP(\Z x \>r/2), 



< Mex P ^(^ 



On the event ^4, we have 



X T W 



(10) 



Any vector 9 in L or Q D satisfies the Dantzig constraint ^ . Thus we have on 
A that 



sup 

See 



3r 



Now we prove the second inequality. For any 9 D £ Q D , we have by definition 
that l^li s$ |6»*|i, thus 



IA 



Consider now the Lasso estimators. By definition, we have for any 9 £ 

i|Y - X^H + 2r|0 L |i < + 2r|0*|i. 

n n 
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Developing the left hand side on the above inequality, we get 

2r|(9 L | 1 s$ 2r|0*|i + -{6 L - 8*) T X T W. 
n 

On the event A, we have for any 8 L £ Q L 

2\e L \ 1 ^2\6*\ 1 + \e L -e*\ l7 

Adding \9 L - 9*\ 1 on both side, we get 

\§ L - d*\i +2\§ L \i < 2|6»*|i+2|0 L -r|i 

\e L -8*\! < 2(\§ L -rix + irix- i^io, 

Now we remark that if j e J{9*) c , then we have \§f -0*\ + \0*\ - \§f\ = 0. Thus 
we have on the event A that 

IAj^Ix-IAj^Ii s: |A|i<2|A J(fl .)|i 
|A J(e , )0 |i < 

for any 6 L £ 6 L . □ 

Lemma 2. Let Assumption\^be satisfied. Then 



( \ A • • \ XX \* >> A 1 ^ n 
k(s,co) = nun mm — r- ^ i/l > 1). 

JC{l,-,M},|J|<s A?40:|Ajc|i<co|Aj|i V«I<M2 V a 

Proof. For any subset J of {1, ... , M} such that | J| ^ s and A 6 M M such that 
< c |Aj|i, we have 

n|Aj|l |A^ 

1 £ |A»||A«)| 



> 1- ' : V 



a(l + 2co)s.^ |Aj|l 

> i I (n) 

where we have used Assumption [2] in the second line, Im denotes the M x M 
identity matrix and A,/ = (Aj , . . . , Aj M ^) denotes the components of the vector 
A,/. This yields 

JXAJI > \X\j\l [ ^ AjA T AAjc 



»|Aj|l n|A-i 2 



^ 1 



JI2 

1 |Aj|? 2 |AjU|Aj.|i 



as(l + 2c ) |Aj|l as(l + 2c ) |Aj|| 



> 

as 

> 1 - - > 0. 

a 
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We have used Assumption [2] in the second line, the inequality | Aj= 1 1 ^ co|Aj|i 
in the third line and the fact that |Aj|i ^ v^PTl^-H 2 ^ V^l^jb in the last 
line. □ 

Proof of Theorem [1} For all 1 ^ j ^ M, 9 £ 9 we have 



M 



- 9)) } 

Assumption [2] yields 



M 



Thus we have 



oo a(l + 2cq)s 



|0* 



(12) 



Set A = 9 — 9* . Lemma [T] yields that on an event A of probability at least 
1 - M l ~ A / 8 we have for any 9 £ 6 



I* A|„ s$ 



3r 

y 



(13) 



and 



|A|i = |A J(e .)o|i + |A J(e .)|i < (1 + c )|A J(e ,)|i < (1 + coJVSlAj^jla. 
Thus we have, on the same event A, 



-\XA\l = A T fA 
n 

< |*AUA|i 



3r 

< Y(l + co)\/i|A 7(e .)| 2 , 



for any 9 £ 9. Lemma [2] yields 

1 



AA| 2 > ( 1- 



|Aj(e*)|2> 



for any 9 £ 9. Combining (fT4")) and (|15p. we obtain that 

|A|i<|r(l + co) 2 ^- S , 
2 a — 1 

for any 9 £ 9. Combining ([12]) , (fT3|) and (|T6|) we obtain that 



Bupie-^u <- 1 



(l+c ) 2 



(l+2c )(a-l) 



r. □ 



(14) 



(15) 



(16) 
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Proof of Theorem [2l Theorem [T] yields supg g Q \9 — #*|oo ^ c 2 r on an event 
A of probability at least 1 - M 1 "^/ 8 . Take 6 £ O. For j e J(6*) c , we have 
8* = 0, and \0j\ ^ c 2 r on A. For j £ J(0*), we have by Assumption [3] that 
|6>*| > c\r and |0*| — |0,-| ^ \8* — 6j\ < c 2 r on A Since we assume that c\ > 2c 2 , 
we have on A that |0j| ^ (ci — c 2 )r > c 2 r. Thus on the event A we have: 
j £ J{6*) O > c 2 r. This yields sign(^) = sign(^) = sign(0*) if j e J{9*) 
on the event A. If j £ J{0*), sign(0*) = and §j = on A, so that sign(0j) = 0. 
The same reasoning holds true simultaneously for all 6 £ O on the event A. 
Thus we get the result. □ 

Proof of Theorem[3j The proof of Theorem[3]is similar to the one of Theorem 
Q]up to a modification of the bound on P(A C ) in Lemma [TJ Recall that Zj = 
-nT 1 Y^i=i XijWi, 1 ^ j ^ M and the event A is defined by 



The Markov inequality yields that 



M 

A = f]{\Z 3 \ < r/2} = ^max^ \Z 5 \ ^ r/2}. 



p ^ 4E[max 1 ^ M Zf] 



Then we use Lemma [3] given below with p = 00 and the random vectors Yi 
(X itl Wi/n, . . . , X itM Wi/n) G R M , i = 1, . . . , n. We get that 

^ < ^a 2 V max % 

i— 1 



where c > is an absolute constant. Taking r = a^J (log M) 1+s jn and using 
Assumption [5] yields that 

P(A C ) < 



(logM)- 5 ' 

where c > is an absolute constant. □ 

The following result is Lemma 5.2.2, page 188 of (l5j |. 

Lemma 3. Let Y\, . . . ,Y n £ R M be independent random vectors with zero means 
and finite variance, and let M ^ 3. Then for every p £ [2, oo], we have 



E 



<crnin[p,logM]^E[|l-|2] , 



where c > is an absolute constant. 
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